-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Slice Arrow buffer before passing it to numpy (#40896) #41046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Slice Arrow buffer before passing it to numpy (#40896) #41046
Conversation
Add Arrow buffer slicing before handing it over to numpy which is needed in case the Arrow buffer contains padding or offset.
Use numpy dtype for bitwidth information. Fix test signature.
Fix mask buffer in test.
Thanks for the PR!
There is a
Yes, I think it would be good to parameterize the test across different dtypes. |
Move tests to pandas/tests/arrays/masked/test_arrow_compat.py Add more dtypes to test.
I moved the test and added parameterization for all numeric types. |
2ca741b
to
1586d50
Compare
|
||
@pytest.fixture | ||
def np_dtype_to_arrays(request): | ||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just add an import_optional_dependency at the top of this file, rather than having skips on every test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used pyarrow 0.16.0 as the minimal version so test_arrow_array
no longer gets run with pyarrow 0.15.0. Hope this is appropriate.
|
||
@td.skip_if_no("pyarrow") | ||
@pytest.mark.parametrize( | ||
"np_dtype_to_arrays", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the fixture: any_real_dtype or any_numpy_dtype instead
Modify test_arrow_compat.py: - Use import_optional_dependency to skip the whole module if pyarrow is not available. - Use any_real_dtype fixture.
ec58ea4
to
5652869
Compare
try: | ||
pa = import_optional_dependency("pyarrow", min_version="0.16.0") | ||
except ImportError: | ||
pytestmark = pytest.mark.skip(reason="Pyarrow not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the imports in every test (as you did now), you could also do a
try:
import pyarrow as pa
except ImportError:
pa = None
(assuming we keep the skip_if_no
)
np_dtype = np.dtype(any_real_dtype) | ||
pa_type = pa.from_numpy_dtype(np_dtype) | ||
|
||
pa_array = pa.array([0, 1, 2], type=pa_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pa_array = pa.array([0, 1, 2], type=pa_type) | |
pa_array = pa.array([0, 1, 2, None], type=pa_type) |
Maybe good to include a missing value, to ensure there is a bitmask buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point!
Add whatsnew entry.
Enforce creation of Arrow bitmask buffer.
Merge branch 'master' into issue-40896
80fb1c7
to
0f3302c
Compare
0f3302c
to
c195089
Compare
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -882,6 +882,7 @@ ExtensionArray | |||
- Fixed bug where :meth:`Series.idxmax`, :meth:`Series.idxmin` and ``argmax/min`` fail when the underlying data is :class:`ExtensionArray` (:issue:`32749`, :issue:`33719`, :issue:`36566`) | |||
- Fixed a bug where some properties of subclasses of :class:`PandasExtensionDtype` where improperly cached (:issue:`40329`) | |||
- Bug in :meth:`DataFrame.mask` where masking a :class:`Dataframe` with an :class:`ExtensionArray` dtype raises ``ValueError`` (:issue:`40941`) | |||
- Bug in :func:`pyarrow_array_to_numpy_and_mask` where Numpy raises ``ValueError`` if buffer size does not match multiple of dtype size (:issue:`40896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change this to a user facing note.
import pytest | ||
|
||
import pandas.util._test_decorators as td | ||
|
||
import pandas as pd | ||
import pandas._testing as tm | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use pyimportorskip here. i am not sure we can test anything w/o it.
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -882,6 +882,7 @@ ExtensionArray | |||
- Fixed bug where :meth:`Series.idxmax`, :meth:`Series.idxmin` and ``argmax/min`` fail when the underlying data is :class:`ExtensionArray` (:issue:`32749`, :issue:`33719`, :issue:`36566`) | |||
- Fixed a bug where some properties of subclasses of :class:`PandasExtensionDtype` where improperly cached (:issue:`40329`) | |||
- Bug in :meth:`DataFrame.mask` where masking a :class:`Dataframe` with an :class:`ExtensionArray` dtype raises ``ValueError`` (:issue:`40941`) | |||
- Bug in :func:`pyarrow_array_to_numpy_and_mask` when converting a pyarrow array who's data buffer size is not a multiple of dtype size (:issue:`40896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bug in :func:`pyarrow_array_to_numpy_and_mask` when converting a pyarrow array who's data buffer size is not a multiple of dtype size (:issue:`40896`) | |
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`) |
pyarrow_array_to_numpy_and_mask
is not a public method, so we should try to avoid mentioning it. So a suggested rewording above.
Also, I would maybe move this to the I/O
section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed that as suggested. Also, let me know if I should rebase to get rid of the small commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, let me know if I should rebase to get rid of the small commits.
No need to do that, we will squash everything when merging (well, github does that for us ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Looking good to me
d18922e
to
9aa5df6
Compare
@@ -796,6 +796,7 @@ I/O | |||
- Bug in :func:`read_excel` dropping empty values from single-column spreadsheets (:issue:`39808`) | |||
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`) | |||
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`) | |||
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not object to actually showing read_parquet
reference here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's needed (the original bug report didn't actually involve a plain reading of a Parquet file (not sure this can happen with that), but with a dataset scan+filter operation (from any source format).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was creating the arrow table using only the arrow api. The bug then occured on the conversion to pandas.
I did not actually try to reproduce the bug using read_parquet
, so I'm also not sure if it can happen this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it could happen when reading a parquet file using the new dataset API by passing a certain filter. But anyway, I don't think we need to make that explicit, IMO the above error message is clear enough.
@@ -89,12 +86,89 @@ def test_arrow_sliced(data): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
@pytest.fixture | |||
def np_dtype_to_arrays(any_real_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this (mainly the output for better readability) also adding a doc-string explaining the purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fixture in the tests, I don't think we generally ask for annotations there? (eg in conftest.py there is no typing)
Also tests empty pyarrow arrays with non empty buffers. | ||
See https://github.com/pandas-dev/pandas/issues/40896 | ||
""" | ||
from pandas.core.arrays._arrow_utils import pyarrow_array_to_numpy_and_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you import at the top (after the pyarrow check is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I move the rest of the pandas imports below the pyarrow check as well? Looks a bit weird having them seperated by the check.
thanks @ThomasBlauthQC i think the checks are running, but can merge on green. |
Looks green @jreback. Merge? |
@ThomasBlauthQC Thanks a lot for the fix! |
Add Arrow buffer slicing before handing it over to numpy which is
needed in case the Arrow buffer contains padding or offset.
Points I'm not sure about:
I'm still new to the internals of pandas and always trying to improve so feedback of any kind is highly appreciated!